Goto

Collaborating Authors

 feature description




Sharing Knowledge for Meta-learning with Feature Descriptions

Neural Information Processing Systems

Language is an important tool for humans to share knowledge. We propose a meta-learning method that shares knowledge across supervised learning tasks using feature descriptions written in natural language, which have not been used in the existing meta-learning methods. The proposed method improves the predictive performance on unseen tasks with a limited number of labeled data by meta-learning from various tasks. With the feature descriptions, we can find relationships across tasks even when their feature spaces are different. The feature descriptions are encoded using a language model pretrained with a large corpus, which enables us to incorporate human knowledge stored in the corpus into meta-learning. In our experiments, we demonstrate that the proposed method achieves better predictive performance than the existing meta-learning methods using a wide variety of real-world datasets provided by the statistical office of the EU and Japan.


Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Kopf, Laura, Feldhus, Nils, Bykov, Kirill, Bommer, Philine Lou, Hedström, Anna, Höhne, Marina M. -C., Eberle, Oliver

arXiv.org Artificial Intelligence

Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Within the context of large language models (LLMs) for natural language processing (NLP), current automated neuron-level feature description methods face two key challenges: limited robustness and the assumption that each neuron encodes a single concept (monosemanticity), despite increasing evidence of polysemanticity. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework specifically designed to capture the complexity of features in LLMs. Unlike approaches that assign a single description per neuron, common in many automated interpretability methods in NLP, PRISM produces more nuanced descriptions that account for both monosemantic and polysemantic behavior. We apply PRISM to LLMs and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).


Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation

Chang, Wei-Chia, Chen, Yan-Ann

arXiv.org Artificial Intelligence

Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.


Circuit Insights: Towards Interpretability Beyond Activations

Golimblevskaia, Elena, Jain, Aakriti, Puri, Bruno, Ibrahim, Ammar, Samek, Wojciech, Lapuschkin, Sebastian

arXiv.org Artificial Intelligence

The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.


Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Boggust, Angie, Ren, Donghao, Assogba, Yannick, Moritz, Dominik, Satyanarayan, Arvind, Hohman, Fred

arXiv.org Artificial Intelligence

Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, we find that semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Moreover, their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regex descriptions help people build accurate mental models of LLM feature activations.



FADE: Why Bad Descriptions Happen to Good Features

Puri, Bruno, Jain, Aakriti, Golimblevskaia, Elena, Kahardipraja, Patrick, Wiegand, Thomas, Samek, Wojciech, Lapuschkin, Sebastian

arXiv.org Artificial Intelligence

Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While they may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for evaluating feature-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes for the misalignment of feature and their description. We apply FADE to analyze existing open-source feature descriptions, and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs as compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE.


Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Gur-Arieh, Yoav, Mayan, Roy, Agassy, Chen, Geiger, Atticus, Geva, Mor

arXiv.org Artificial Intelligence

Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model's representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary "unembedding" head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be "dead".